##  Author: Kiril Boyanov (kirilboyanov [at] gmail.com)
##  LinkedIn: www.linkedin.com/kirilboyanov/
##  Last update: 2023-12-08


In this file, we explore the correlations between happiness and a series of economic, political, societal, environmental and health-related factors. We perform this investigation both by using the most recent data and by looking at historical data so as to see whether the correlations change across time. Additionally, we explore within-country correlations to see whether the strongly correlated factors are mostly the same or different for each country.


Setting things up

Importing relevant packages, defining custom functions, specifying local folders etc.

# Importing relevant packages

# For general data-related tasks
library(plyr)
library(tidyverse)
library(data.table)
library(openxlsx)
library(readxl)
library(arrow)

# For dealing with missing values
library(mice)

# For working with countries
library(countrycode)

# For data visualization
library(ggplot2)
library(plotly)


User input

Throughout the analysis, we will be using a common BaseYear (to represent the past state of happiness) and a common ReferenceYear (to represent the most recent state of happiness). To ensure consistency across files, these two years are stored in a TXT file, which is imported below.

Thus, we use the following years as base and reference:

## Base year:  2005
## Reference year:  2022


Importing & merging data

We import data that was already pre-processed in the WHR_data_prep.Rmd notebook. In here, we use many different data sources, previews of which are available in the following sub-sections.

Happiness data

Country CountryCode Year HappinessScore RowID CountryRank Continent Region
Finland FIN 2022 7.8210 FIN_2022 1 Europe Europe & Central Asia
Denmark DNK 2022 7.6362 DNK_2022 2 Europe Europe & Central Asia
Iceland ISL 2022 7.5575 ISL_2022 3 Europe Europe & Central Asia
Switzerland CHE 2022 7.5116 CHE_2022 4 Europe Europe & Central Asia
Netherlands NLD 2022 7.4149 NLD_2022 5 Europe Europe & Central Asia


Background data

Note that we have several similar measures, e.g. GDP in constant vs. current prices. While they convey what is essentially the same information, it’s too early to remove them: we need to see which ones are the most strong correlates to make this decision.

RowID CountryISO3 Country Year P_ControlOfCorruption P_PoliticalStability P_RuleOfLaw P_VoiceAndAccountability P_GovernmentEffectiveness P_CorruptionPerceptionIndex P_ElectoralDemocracyIndex P_FreedomOfExpression P_FreedomOfAssociation P_PopulationPctWithSuffrage P_CleanElectionsIndex P_ElectedOfficialsIndex P_ConflictsFatalityPerCountry S_TotalPopulation S_PopulationAged14OrLess S_PopulationAged15To64 S_PopulationAged65OrMore S_NetMigration S_UrbanPopulation S_UrbanPopPctOfTotal S_TotalHomicideRate S_FemaleHomicideRate S_MaleHomicideRate S_FemaleBankingAccessPctOfPop S_FemaleSchoolDropoutRate S_FemaleManagementPctOfTotal S_LaborParticipRateFemale S_LaborForcePctFemale S_FemaleLiteracyRate S_AccessToCleanFuelsPctOfTotal S_AccessToCleanFuelsPctOfRural S_AccessToCleanFuelsPctOfUrban S_AccessToElectricityPctOfTotal S_AccessToElectricityPctOfRural S_AccessToElectricityPctOfUrban S_CompulsoryEducationYears S_PreprimaryEducationYears S_PrimaryEducationYears S_SecondaryEducationYears S_BScAttainedPopAged25OrMore S_LowerSecEduPctOfTotal S_PrimaryEduPopAged25OrMore S_UpperSecEduPctOfTotal S_MScAttainedPopAged25OrMore E_GDPPerCapitaConstant E_GDPPerCapitaCurrent E_GiniIndex E_PovertyGap215PctOfPop E_PovertyGap365PctOfPop E_PovertyGap685PctOfPop E_PovertyGap215Headcount E_PovertyGap365Headcount E_PovertyGap685Headcount E_PovertyHeadcountPctOfPop E_PovertyHeadcountPctOfAged17OrLess E_ChildPovertyIndex E_PovertyHeadcountPctOfHouseholds E_TotalPovertyIndex E_LaborTaxPctOfProfits E_ProfitTaxPctOfProfits E_TaxOnGoodsServicecPctOfRevenue E_TaxOnGoodsServicecPctOfValAdded E_TotalTaxPctOfProfits E_ImportsPctOfGDP E_ExportsPctOfGDP E_ConsumerPriceInflation E_FemaleUnemployment E_MaleUnemployment E_TotalUnemployment E_TotalUnemploymentLocalEst E_YouthUnemployment E_YouthUnemploymentLocalEst E_HealthExpenditurePctOfGDP E_HealthExpenditurePerCapita E_EducationExpenditurePctOfGDP E_EduExpPerStudentPrimaryPctOfGDP E_EduExpPerStudentSecondPctOfGDP E_EduExpPerStudentTertiaryPctOfGDP E_MilitaryExpenditurePctOfGDP E_ResearchAndDevExpPctOfGDP V_CO2EmissionsKgPerConstantGDP V_CO2EmissionsKgPerGDP V_CO2EmissionsKt V_CO2PerCapita V_AgriculturalLandPctOfTotal V_ArableLandPctOfTotal V_FertilizerUseKgPerHectare V_ForestAreaPctOfTotal V_TotalLandArea V_ProtectedLandAreasPctOfTotal V_TotalProtectedAreas H_TotalSuicideRate H_FemaleSuicideRate H_MaleSuicideRate H_AirPollutionMeanExpPctOfPop H_AirPollutionOverExpPctOfPop H_AccessToEssentialDrugs H_InfantMortalityRate
AFG_1960 AFG Afghanistan 1960 NA NA NA NA NA NA 0.080 0.156 0.106 0.5 0.111 0 NA 8622466 3589290 4788899 244277 2606 724373 8.401 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 7.024793 4.132233 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
AFG_1961 AFG Afghanistan 1961 NA NA NA NA NA NA 0.083 0.165 0.111 0.5 0.112 0 NA 8790140 3665076 4877387 247678 6109 763336 8.684 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 8.097166 4.453443 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 57.87836 11.72899 0.1437908 NA 652230 NA NA NA NA NA NA NA NA NA
AFG_1962 AFG Afghanistan 1962 NA NA NA NA NA NA 0.082 0.165 0.110 0.5 0.112 0 NA 8969047 3746296 4971702 251049 7016 805062 8.976 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 9.349593 4.878051 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 57.95502 11.80565 0.1428571 NA 652230 NA NA NA NA NA NA NA NA NA
AFG_1963 AFG Afghanistan 1963 NA NA NA NA NA NA 0.085 0.172 0.134 0.5 0.105 0 NA 9157465 3835639 5067343 254483 6681 849446 9.276 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 16.863910 9.171601 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 58.03168 11.88231 0.1419355 NA 652230 NA NA NA NA NA NA NA NA NA
AFG_1964 AFG Afghanistan 1964 NA NA NA NA NA NA 0.137 0.241 0.210 1.0 0.130 0 NA 9355514 3934872 5162530 258112 7079 896820 9.586 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 18.055555 8.888893 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 58.11600 11.95897 0.1410256 NA 652230 NA NA NA NA NA NA NA NA NA


Merging happiness data with background data

The data in here was already put together in the WHR_data_prep.Rmd notebook, so in here, we merely need to merge it with the data on happiness data. In here, it’s important to take note of variables that may have too many missing values as this might impact the overall data quality and make some types of analysis unfeasible.

The table below shows all available indicators, their respective categories as well as the percentage of missing values in each column:

Area Indicator PctMissing_AllTime PctMissing_BaseYear PctMissing_RefYear
Political P_ControlOfCorruption 0.0 0.0 0.0
Political P_PoliticalStability 0.0 0.0 0.0
Political P_RuleOfLaw 0.0 0.0 0.0
Political P_VoiceAndAccountability 0.0 0.0 0.0
Political P_GovernmentEffectiveness 0.0 0.0 0.0
Political P_CorruptionPerceptionIndex 0.0 3.7 0.0
Political P_ElectoralDemocracyIndex 0.1 0.0 0.0
Political P_FreedomOfExpression 0.1 0.0 0.0
Political P_FreedomOfAssociation 0.1 0.0 0.0
Political P_PopulationPctWithSuffrage 0.1 0.0 0.0
Political P_CleanElectionsIndex 0.1 0.0 0.0
Political P_ElectedOfficialsIndex 0.1 0.0 0.0
Political P_ConflictsFatalityPerCountry 47.0 55.6 43.4
Societal S_TotalPopulation 0.7 0.0 0.7
Societal S_PopulationAged14OrLess 0.7 0.0 0.7
Societal S_PopulationAged15To64 0.7 0.0 0.7
Societal S_PopulationAged65OrMore 0.7 0.0 0.7
Societal S_NetMigration 0.7 0.0 0.7
Societal S_UrbanPopulation 0.7 0.0 0.7
Societal S_UrbanPopPctOfTotal 0.7 0.0 0.7
Societal S_TotalHomicideRate 10.4 7.4 9.0
Societal S_FemaleHomicideRate 31.0 22.2 25.5
Societal S_MaleHomicideRate 30.9 22.2 26.2
Societal S_FemaleBankingAccessPctOfPop 27.3 100.0 0.7
Societal S_FemaleSchoolDropoutRate 13.5 25.9 4.8
Societal S_FemaleManagementPctOfTotal 35.7 33.3 22.8
Societal S_LaborParticipRateFemale 0.7 0.0 0.7
Societal S_LaborForcePctFemale 0.7 0.0 0.7
Societal S_FemaleLiteracyRate 16.1 44.4 15.9
Societal S_AccessToCleanFuelsPctOfTotal 3.7 3.7 4.1
Societal S_AccessToCleanFuelsPctOfRural 3.7 3.7 4.1
Societal S_AccessToCleanFuelsPctOfUrban 3.7 3.7 4.1
Societal S_AccessToElectricityPctOfTotal 0.7 0.0 0.7
Societal S_AccessToElectricityPctOfRural 0.8 0.0 0.7
Societal S_AccessToElectricityPctOfUrban 0.7 0.0 0.7
Societal S_CompulsoryEducationYears 4.3 0.0 3.4
Societal S_PreprimaryEducationYears 11.5 33.3 2.8
Societal S_PrimaryEducationYears 0.7 0.0 0.7
Societal S_SecondaryEducationYears 0.7 0.0 0.7
Societal S_BScAttainedPopAged25OrMore 44.9 92.6 14.5
Societal S_LowerSecEduPctOfTotal 5.1 11.1 2.8
Societal S_PrimaryEduPopAged25OrMore 11.8 22.2 6.9
Societal S_UpperSecEduPctOfTotal 7.3 18.5 4.1
Societal S_MScAttainedPopAged25OrMore 58.8 100.0 30.3
Economic E_GDPPerCapitaConstant 2.8 3.7 2.1
Economic E_GDPPerCapitaCurrent 1.2 0.0 0.7
Economic E_GiniIndex 8.3 14.8 6.9
Economic E_PovertyGap215PctOfPop 8.3 14.8 6.9
Economic E_PovertyGap365PctOfPop 8.3 14.8 6.9
Economic E_PovertyGap685PctOfPop 8.3 14.8 6.9
Economic E_PovertyGap215Headcount 8.3 14.8 6.9
Economic E_PovertyGap365Headcount 8.3 14.8 6.9
Economic E_PovertyGap685Headcount 8.3 14.8 6.9
Economic E_PovertyHeadcountPctOfPop 69.9 100.0 57.9
Economic E_PovertyHeadcountPctOfAged17OrLess 75.5 100.0 65.5
Economic E_ChildPovertyIndex 97.7 100.0 94.5
Economic E_PovertyHeadcountPctOfHouseholds 93.8 100.0 89.0
Economic E_TotalPovertyIndex 90.2 100.0 82.1
Economic E_LaborTaxPctOfProfits 4.9 14.8 1.4
Economic E_ProfitTaxPctOfProfits 4.9 14.8 1.4
Economic E_TaxOnGoodsServicecPctOfRevenue 15.0 11.1 11.7
Economic E_TaxOnGoodsServicecPctOfValAdded 16.7 18.5 12.4
Economic E_TotalTaxPctOfProfits 4.9 14.8 1.4
Economic E_ImportsPctOfGDP 2.3 0.0 2.1
Economic E_ExportsPctOfGDP 2.3 0.0 2.1
Economic E_ConsumerPriceInflation 2.8 7.4 2.1
Economic E_FemaleUnemployment 2.1 0.0 1.4
Economic E_MaleUnemployment 0.7 0.0 0.7
Economic E_TotalUnemployment 0.7 0.0 0.7
Economic E_TotalUnemploymentLocalEst 1.3 0.0 0.7
Economic E_YouthUnemployment 0.7 0.0 0.7
Economic E_YouthUnemploymentLocalEst 6.5 7.4 1.4
Economic E_HealthExpenditurePctOfGDP 2.5 0.0 2.1
Economic E_HealthExpenditurePerCapita 2.5 0.0 2.1
Economic E_EducationExpenditurePctOfGDP 3.1 0.0 2.1
Economic E_EduExpPerStudentPrimaryPctOfGDP 18.9 37.0 12.4
Economic E_EduExpPerStudentSecondPctOfGDP 20.5 33.3 13.1
Economic E_EduExpPerStudentTertiaryPctOfGDP 16.1 18.5 10.3
Economic E_MilitaryExpenditurePctOfGDP 3.7 0.0 4.1
Economic E_ResearchAndDevExpPctOfGDP 15.7 3.7 12.4
Environmental V_CO2EmissionsKgPerConstantGDP 4.0 3.7 3.4
Environmental V_CO2EmissionsKgPerGDP 2.5 0.0 2.1
Environmental V_CO2EmissionsKt 1.9 0.0 2.1
Environmental V_CO2PerCapita 1.9 0.0 2.1
Environmental V_AgriculturalLandPctOfTotal 0.7 0.0 0.7
Environmental V_ArableLandPctOfTotal 0.7 0.0 0.7
Environmental V_FertilizerUseKgPerHectare 0.7 0.0 0.7
Environmental V_ForestAreaPctOfTotal 1.3 0.0 1.4
Environmental V_TotalLandArea 0.7 0.0 0.7
Environmental V_ProtectedLandAreasPctOfTotal 57.5 100.0 0.7
Environmental V_TotalProtectedAreas 87.4 85.2 88.3
Health-related H_TotalSuicideRate 1.9 0.0 2.1
Health-related H_FemaleSuicideRate 1.9 0.0 2.1
Health-related H_MaleSuicideRate 1.9 0.0 2.1
Health-related H_AirPollutionMeanExpPctOfPop 1.3 0.0 1.4
Health-related H_AirPollutionOverExpPctOfPop 1.3 0.0 1.4
Health-related H_AccessToEssentialDrugs 99.0 100.0 99.3
Health-related H_InfantMortalityRate 92.0 92.6 91.7


Dealing with missing data

As we saw in the table with the various indicators above, we do have a rather elevated share of missing data for some columns. This makes these columns rather unusable for analytic purposes if left untreated. Therefore, we need to have a strategy for dealing with missing data.


Exploring data availability across time

However, as the chart below shows, the availability of the data varies across time, so it may not be wise to apply the same strategy:

Specifically, we can see that data is missing a lot more often in the BaseYear than in the ReferenceYear, with the entirety of the historical data lying in-between. This indicates the improvement of data quality across time.


Differentiating the approach to dealing with missing data

In this and the subsequent section, we work with the following three datasets:

  • DataForAnalysis, containing the entirety of the historical data

  • DataForBaseYear, containing only data from the base year we’ve selected to denote the past state of happiness

  • DataForReferenceYear, containing only data from the reference year we’ve selected to denote the current state of happiness


Automatically dropping columns with too many missing values

By definition, we will entirely remove columns that contain more than 10% missing data. A summary of the number of columns before and after the removal of the problematic variables is printed out below:

Dataset ColumnsBefore ColumnsAfter ColumnsDropped PctDropped
All time 106 79 27 25.5
Base year 106 69 37 34.9
Reference year 106 85 21 19.8


Imputing data on a per-country basis

Method description

The next part in the process is slightly more tricky as we do have many different columns which may have a very different nature. In here, we will be using the missForest package to impute for all missing values in all variables. This is not done on a per-country basis as certain countries have no data for some fields, making it impossible to create meaningful imputations. By doing the imputation at the global level, we’re able to benefit from cross-country pattern recognition and discovering inter-dependencies between variables.

The technique used in here is based on a series of automatically defined and fitted random forest (RF) models (each model is cross-validated 5 times). The advantage of using this method is that there are no expectations as to the distribution of the observations and also the fact that we can easily get some summary stats on how accurate the forecasts will likely be.

To further improve the reliability of the imputation, we create what is essentially a series of 10 datasets with different imputed values. By doing so, we account for the randomness of the missing observations as we can later on combine these datasets into one and run our subsequent analysis on all of them. This will increase the number of observations but in a proportional manner, so the original values will matter in exactly the same way as they did before the imputation was performed. Please note that these duplicates are dealt with further down the line.


Data preparation

To prevent errors from affecting our imputation, we remove countries which have less than 5 observations in the all-time historical data (otherwise, the RF models will not be able to produce any results). An overview of the countries we’re removing from our analysis is printed below:

## `summarise()` has grouped output by 'Country'. You can override using the
## `.groups` argument.
Country CountryCode NumberOfRows
Angola AGO 4
Belize BLZ 2
Bhutan BTN 3
Congo COG 1
Cuba CUB 1
Djibouti DJI 4
Eswatini SWZ 3
Eswatini, Kingdom of SWZ 1
Gambia GMB 4
Guyana GUY 1
Maldives MDV 1
Oman OMN 1
Somalia SOM 3
South Sudan SSD 4
Suriname SUR 1


Data imputation

Please note that process of imputing data via RF models may take some time to complete as it can be computationally intensive. Some errors may be generated in some cases as we may have oddly-looking data within some countries/indicators (e.g. not enough predictors/observations to generate RF models).


Dealing with the artificially introduced duplicate rows

To get over the artificial increase in number of observations caused by the random forest imputation technique, we will group all rows by year and country and then use the mean value for each of the indicators. If any missing values remain after this step, we will group the rows by year only and then use that average (this second part only concerns less than 70 rows, or less than 3.5% of all rows, and applies to entities not universally recognized as independent countries such as Hong Kong and the Palestinian Authority).

As a final check, we once again see whether we have any missing values after this step of the process:

## `summarise()` has grouped output by 'Country', 'CountryCode', 'RowID',
## 'Continent', 'Region'. You can override using the `.groups` argument.


Performing some quality checks after the imputation

Before continuing, we check whether we have any missing countries in the dataset containing the imputed observations:

## [1] "No countries were found to be missing in the dataset containing the imputed values."

Furthermore, we check whether there are any persisting missing values even after the imputation (this shouldn’t be the case):

## [1] "There are no missing values in the data frame containing the imputations."


Final adjustments

As we applied the imputation on the all-time historical data but we now need to split the data so we have separate datasets for the base and the reference years. We need to do this so we can stay true to our original methodology, where we completely excluded columns which had more than 10% missing data.

After these adjustments are applied, we export all three datasets so that the analysis can continue in another notebook. With this, we’re finally ready to start modelling the data!